Towards Zulu corpus clean-up, lexicon development and corpus annotation by means of computational morphological analysis
نویسندگان
چکیده
منابع مشابه
Corpus-Induced Corpus Clean-up
We explore the feasibility of using only unsupervised means to identify non-words, i.e. typos, in a frequency list derived from a large corpus of Dutch and to distinguish between these non-words and real-words in the language. We call the system we built and evaluate in this paper CICCL, which stands for ‘Corpus-Induced Corpus Clean-up’. The algorithm on which CICCL is primarily based is the an...
متن کاملUkwabelana - An open-source morphological Zulu corpus
Zulu is an indigenous language of South Africa, and one of the eleven official languages of that country. It is spoken by about 11 million speakers. Although it is similar in size to some Western languages, e.g. Swedish, it is considerably under-resourced. This paper presents a new open-source morphological corpus for Zulu named Ukwabelana corpus. We describe the agglutinating morphology of Zul...
متن کاملA Large Semantic Lexicon for Corpus Annotation
Semantic lexical resources play an important part in both corpus linguistics and NLP. Over the past 14 years, a large semantic lexical resource has been built at Lancaster University. Different from other major semantic lexicons in existence, such as WordNet, EuroWordNet and HowNet, etc., in which lexemes are clustered and linked via the relationship between word/MWE senses or definitions of me...
متن کاملAutomatic Lexicon Enhancement by Means of Corpus Tagging
Using specialised text corpus to automatically enhance a general lexicon is the aim of this study. Indeed, having lexicons which offer maximal cover on a specific topic is an important benefit in many applications of Automatic Speech and Natural Language Processing. The enhancement of these lexicons can be made automatic as big corpora of specialised texts are available. A syntactic tagging pro...
متن کاملMorphological Annotation of the Lithuanian Corpus
As the development of information technologies makes progress, large morphologically annotated corpora become a necessity, as they are necessary for moving onto higher levels of language computerisation (e. g. automatic syntactic and semantic analysis, information extraction, machine translation). Research of morphological disambiguation and morphological annotation of the 100 million word Lith...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: South African Journal of African Languages
سال: 2011
ISSN: 0257-2117,2305-1159
DOI: 10.1080/02572117.2019.12063275